Hyperparameter Tuning with Optuna

With great models, comes the great problem of optimizing hyperparameters [Tha20]. Once a good search algorithm is established for hyperparameter optimization, the task becomes an engineering problem 1. Hence, we will explore an open-source library that offers a framework for solving this task.

../_images/optuna.png

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.

Basics with scikit-learn

Optuna is a black-box optimizer, which means it only needs an objective function, which is any function that returns a numerical value, to evaluate the performance of the its parameters, and decide where to sample in upcoming trials. An optimization problem is framed in the Optuna API using two basic concepts: study and trial.

A study is conceptually an optimization based on an objective function, while a trial is a single execution of an objective function. The combination of hyperparameters for each trial is sampled according to some sampling algorithm defined by the study.

In the following code example, the search space is constructed within imperative Python code, e.g. inside conditionals or loops. On the other hand, recall that for GridSearchCV and RandomSearchCV in scikit-learn, we had to define the entire search space before running the search algorithm.

!pip install optuna
import optuna
import pandas as pd
from sklearn import ensemble, svm
from sklearn import datasets
from sklearn import model_selection
from functools import partial
import joblib


# [1] Define an objective function to be maximized.
def objective(trial, X, y):
    
    # [2] Suggest values for the hyperparameters using trial object.
    clf_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest'])
    if clf_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        clf = svm.SVC(C=svc_c, gamma='auto')
    else:
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        clf = ensemble.RandomForestClassifier(max_depth=rf_max_depth, n_estimators=10)

    score = model_selection.cross_val_score(clf, X, y, n_jobs=-1, cv=5)
    return score.mean()

# [3] Create a study object and optimize the objective function.
X, y = datasets.load_breast_cancer(return_X_y=True)
study = optuna.create_study(direction="maximize")
study.optimize(partial(objective, X=X, y=y), n_trials=5)
Collecting optuna
  Using cached optuna-2.9.1-py3-none-any.whl (302 kB)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.0)
Collecting cmaes>=0.8.2
  Using cached cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting cliff
  Using cached cliff-3.9.0-py3-none-any.whl (80 kB)
Collecting alembic
  Using cached alembic-1.7.3-py3-none-any.whl (208 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.19.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.62.2)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Collecting colorlog
  Using cached colorlog-6.4.1-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.23)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (2.4.7)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.8.1)
Collecting Mako
  Using cached Mako-1.1.5-py2.py3-none-any.whl (75 kB)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.2.2)
Collecting cmd2>=1.0.0
  Using cached cmd2-2.2.0-py3-none-any.whl (144 kB)
Collecting stevedore>=2.0.1
  Using cached stevedore-3.4.0-py3-none-any.whl (49 kB)
Collecting autopage>=0.4.0
  Using cached autopage-0.4.0-py3-none-any.whl (20 kB)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (2.2.0)
Collecting pbr!=2.1.0,>=2.0.0
  Using cached pbr-5.6.0-py2.py3-none-any.whl (111 kB)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.7.4.3)
Collecting pyperclip>=1.6
  Using cached pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.2.0)
Collecting colorama>=0.3.7
  Using cached colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.5.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
  Building wheel for pyperclip (setup.py) ... ?25l?25hdone
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11136 sha256=c24c993a79f5a247e9f966c493c9b358aa0dcd430b4c0e0666528bb6339ba109
  Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built pyperclip
Installing collected packages: pyperclip, pbr, colorama, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.1.5 alembic-1.7.3 autopage-0.4.0 cliff-3.9.0 cmaes-0.8.2 cmd2-2.2.0 colorama-0.4.4 colorlog-6.4.1 optuna-2.9.1 pbr-5.6.0 pyperclip-1.8.2 stevedore-3.4.0
[I 2021-09-23 10:37:22,628] A new study created in memory with name: no-name-cf61fd1f-c292-4d08-8331-b61df2b285b5
[I 2021-09-23 10:37:23,768] Trial 0 finished with value: 0.9402732494954199 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 2.3445281943268284}. Best is trial 0 with value: 0.9402732494954199.
[I 2021-09-23 10:37:23,896] Trial 1 finished with value: 0.9718987734823784 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 7.197952463169158}. Best is trial 1 with value: 0.9718987734823784.
[I 2021-09-23 10:37:23,989] Trial 2 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 0.051722222321909525}. Best is trial 1 with value: 0.9718987734823784.
[I 2021-09-23 10:37:24,061] Trial 3 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 9.41442891449573e-09}. Best is trial 1 with value: 0.9718987734823784.
[I 2021-09-23 10:37:24,147] Trial 4 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 0.0011946516826625228}. Best is trial 1 with value: 0.9718987734823784.

The study object saves the result of evaluating the objective each trial — which is essentially some choice of hyperparameters to evaluate. In the above study, the problem of model selection is framed as a hyperparameter optimization problem. Here we choose between an SVM-based algorithm or Random Forest.

study.trials_dataframe().head()
number value datetime_start datetime_complete duration params_classifier params_rf_max_depth params_svc_c state
0 0 0.940273 2021-09-23 10:37:22.632861 2021-09-23 10:37:23.767644 0 days 00:00:01.134783 RandomForest 2.344528 NaN COMPLETE
1 1 0.971899 2021-09-23 10:37:23.770200 2021-09-23 10:37:23.896625 0 days 00:00:00.126425 RandomForest 7.197952 NaN COMPLETE
2 2 0.627418 2021-09-23 10:37:23.900134 2021-09-23 10:37:23.989031 0 days 00:00:00.088897 SVC NaN 5.172222e-02 COMPLETE
3 3 0.627418 2021-09-23 10:37:23.990912 2021-09-23 10:37:24.060850 0 days 00:00:00.069938 SVC NaN 9.414429e-09 COMPLETE
4 4 0.627418 2021-09-23 10:37:24.062551 2021-09-23 10:37:24.147696 0 days 00:00:00.085145 SVC NaN 1.194652e-03 COMPLETE

Fine tuning Random Forest

Here we focus on tuning a single Random Forest model. Then, plot the accuracy for each pair of hyperparameters.

def objective(trial):
    
    max_depth = trial.suggest_int('max_depth', 2, 128, log=True)    
    max_features = trial.suggest_float('max_features', 0.1, 1.0)    
    n_estimators = trial.suggest_int('n_estimators', 100, 800)
    
    clf = ensemble.RandomForestClassifier(
        max_depth=max_depth,
        n_estimators=n_estimators,
        max_features=max_features,
        random_state=42)   
    
    score = model_selection.cross_val_score(clf, X, y, n_jobs=-1, cv=5)
    return score.mean()


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=60)
[I 2021-09-23 06:42:48,998] A new study created in memory with name: no-name-21890d05-2176-4274-8c50-b293d30f31e3
[I 2021-09-23 06:42:51,527] Trial 0 finished with value: 0.9543393882937432 and parameters: {'max_depth': 3, 'max_features': 0.9705595686843378, 'n_estimators': 100}. Best is trial 0 with value: 0.9543393882937432.
[I 2021-09-23 06:42:55,482] Trial 1 finished with value: 0.9596180717279925 and parameters: {'max_depth': 13, 'max_features': 0.46775203542826993, 'n_estimators': 246}. Best is trial 1 with value: 0.9596180717279925.
[I 2021-09-23 06:43:00,795] Trial 2 finished with value: 0.9578792113025927 and parameters: {'max_depth': 23, 'max_features': 0.9062990961333452, 'n_estimators': 155}. Best is trial 1 with value: 0.9596180717279925.
[I 2021-09-23 06:43:15,038] Trial 3 finished with value: 0.95960254618848 and parameters: {'max_depth': 5, 'max_features': 0.2578188083038355, 'n_estimators': 749}. Best is trial 1 with value: 0.9596180717279925.
[I 2021-09-23 06:43:30,242] Trial 4 finished with value: 0.9596180717279925 and parameters: {'max_depth': 37, 'max_features': 0.749490296122596, 'n_estimators': 602}. Best is trial 1 with value: 0.9596180717279925.
[I 2021-09-23 06:43:36,503] Trial 5 finished with value: 0.9613879832324173 and parameters: {'max_depth': 22, 'max_features': 0.6506965527914478, 'n_estimators': 416}. Best is trial 5 with value: 0.9613879832324173.
[I 2021-09-23 06:43:43,341] Trial 6 finished with value: 0.9525694767893185 and parameters: {'max_depth': 3, 'max_features': 0.1995450947329181, 'n_estimators': 794}. Best is trial 5 with value: 0.9613879832324173.
[I 2021-09-23 06:43:51,923] Trial 7 finished with value: 0.9596180717279925 and parameters: {'max_depth': 103, 'max_features': 0.7122053899029194, 'n_estimators': 544}. Best is trial 5 with value: 0.9613879832324173.
[I 2021-09-23 06:43:58,840] Trial 8 finished with value: 0.9631113181183046 and parameters: {'max_depth': 54, 'max_features': 0.19194668972744688, 'n_estimators': 731}. Best is trial 8 with value: 0.9631113181183046.
[I 2021-09-23 06:44:01,250] Trial 9 finished with value: 0.9613724576929048 and parameters: {'max_depth': 16, 'max_features': 0.4894931002552039, 'n_estimators': 179}. Best is trial 8 with value: 0.9631113181183046.
[I 2021-09-23 06:44:04,481] Trial 10 finished with value: 0.95960254618848 and parameters: {'max_depth': 95, 'max_features': 0.11170498668801267, 'n_estimators': 372}. Best is trial 8 with value: 0.9631113181183046.
[I 2021-09-23 06:44:09,013] Trial 11 finished with value: 0.9613569321533924 and parameters: {'max_depth': 45, 'max_features': 0.36449069283406643, 'n_estimators': 400}. Best is trial 8 with value: 0.9631113181183046.
[I 2021-09-23 06:44:19,093] Trial 12 finished with value: 0.9596180717279925 and parameters: {'max_depth': 8, 'max_features': 0.6693668460810014, 'n_estimators': 610}. Best is trial 8 with value: 0.9631113181183046.
[I 2021-09-23 06:44:25,745] Trial 13 finished with value: 0.9596180717279925 and parameters: {'max_depth': 46, 'max_features': 0.5815778613171878, 'n_estimators': 472}. Best is trial 8 with value: 0.9631113181183046.
[I 2021-09-23 06:44:29,124] Trial 14 finished with value: 0.9631423691973297 and parameters: {'max_depth': 26, 'max_features': 0.33234183683555163, 'n_estimators': 308}. Best is trial 14 with value: 0.9631423691973297.
[I 2021-09-23 06:44:32,064] Trial 15 finished with value: 0.95960254618848 and parameters: {'max_depth': 70, 'max_features': 0.34448905500108007, 'n_estimators': 261}. Best is trial 14 with value: 0.9631423691973297.
[I 2021-09-23 06:44:37,983] Trial 16 finished with value: 0.9596025461884802 and parameters: {'max_depth': 33, 'max_features': 0.1084223731260328, 'n_estimators': 686}. Best is trial 14 with value: 0.9631423691973297.
[I 2021-09-23 06:44:41,886] Trial 17 finished with value: 0.9613569321533924 and parameters: {'max_depth': 9, 'max_features': 0.3451621038253865, 'n_estimators': 343}. Best is trial 14 with value: 0.9631423691973297.
[I 2021-09-23 06:44:47,055] Trial 18 finished with value: 0.9631268436578171 and parameters: {'max_depth': 68, 'max_features': 0.23707504951615715, 'n_estimators': 510}. Best is trial 14 with value: 0.9631423691973297.
[I 2021-09-23 06:44:53,167] Trial 19 finished with value: 0.9648812296227295 and parameters: {'max_depth': 123, 'max_features': 0.4284868695999386, 'n_estimators': 501}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:44:56,155] Trial 20 finished with value: 0.9490607048594939 and parameters: {'max_depth': 2, 'max_features': 0.4858918861606428, 'n_estimators': 303}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:01,968] Trial 21 finished with value: 0.9613879832324173 and parameters: {'max_depth': 115, 'max_features': 0.39386785749654124, 'n_estimators': 495}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:07,214] Trial 22 finished with value: 0.9613724576929048 and parameters: {'max_depth': 68, 'max_features': 0.22246748754835521, 'n_estimators': 541}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:12,733] Trial 23 finished with value: 0.95960254618848 and parameters: {'max_depth': 121, 'max_features': 0.27140166339787225, 'n_estimators': 465}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:19,591] Trial 24 finished with value: 0.95960254618848 and parameters: {'max_depth': 73, 'max_features': 0.28521774043551834, 'n_estimators': 595}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:23,744] Trial 25 finished with value: 0.9613724576929048 and parameters: {'max_depth': 31, 'max_features': 0.41460082147718447, 'n_estimators': 346}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:31,008] Trial 26 finished with value: 0.9596180717279925 and parameters: {'max_depth': 23, 'max_features': 0.5486258028858528, 'n_estimators': 534}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:36,144] Trial 27 finished with value: 0.9578481602235677 and parameters: {'max_depth': 80, 'max_features': 0.43353643010332477, 'n_estimators': 413}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:39,317] Trial 28 finished with value: 0.9613724576929048 and parameters: {'max_depth': 56, 'max_features': 0.3148553844697188, 'n_estimators': 285}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:40,902] Trial 29 finished with value: 0.9631268436578171 and parameters: {'max_depth': 15, 'max_features': 0.16635263953427581, 'n_estimators': 174}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:49,782] Trial 30 finished with value: 0.9596180717279925 and parameters: {'max_depth': 89, 'max_features': 0.554596584580045, 'n_estimators': 658}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:50,958] Trial 31 finished with value: 0.9543393882937432 and parameters: {'max_depth': 10, 'max_features': 0.18035825668984193, 'n_estimators': 124}. Best is trial 19 with value: 0.9648812296227295.
[I 2021-09-23 06:45:52,746] Trial 32 finished with value: 0.9666356155876418 and parameters: {'max_depth': 16, 'max_features': 0.14768644779668522, 'n_estimators': 192}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:45:55,005] Trial 33 finished with value: 0.9648812296227295 and parameters: {'max_depth': 28, 'max_features': 0.2510243346455137, 'n_estimators': 219}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:45:59,103] Trial 34 finished with value: 0.9596491228070174 and parameters: {'max_depth': 19, 'max_features': 0.9745896683213247, 'n_estimators': 217}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:46:00,268] Trial 35 finished with value: 0.9613569321533924 and parameters: {'max_depth': 12, 'max_features': 0.30501879050629477, 'n_estimators': 103}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:46:02,213] Trial 36 finished with value: 0.9631113181183046 and parameters: {'max_depth': 7, 'max_features': 0.15632147396180562, 'n_estimators': 219}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:46:05,875] Trial 37 finished with value: 0.9578636857630801 and parameters: {'max_depth': 6, 'max_features': 0.8281693351932028, 'n_estimators': 220}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:46:09,494] Trial 38 finished with value: 0.9613414066138798 and parameters: {'max_depth': 4, 'max_features': 0.44095789977754735, 'n_estimators': 313}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:46:11,264] Trial 39 finished with value: 0.9648967551622418 and parameters: {'max_depth': 28, 'max_features': 0.37846712295417323, 'n_estimators': 151}. Best is trial 32 with value: 0.9666356155876418.
[I 2021-09-23 06:46:12,765] Trial 40 finished with value: 0.9666511411271541 and parameters: {'max_depth': 12, 'max_features': 0.24342019325477898, 'n_estimators': 144}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:14,236] Trial 41 finished with value: 0.9666511411271541 and parameters: {'max_depth': 18, 'max_features': 0.25686751699207633, 'n_estimators': 142}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:15,732] Trial 42 finished with value: 0.9631268436578171 and parameters: {'max_depth': 18, 'max_features': 0.25601226916584446, 'n_estimators': 149}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:16,925] Trial 43 finished with value: 0.9560937742586555 and parameters: {'max_depth': 12, 'max_features': 0.13252924130363736, 'n_estimators': 139}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:19,054] Trial 44 finished with value: 0.9613879832324173 and parameters: {'max_depth': 20, 'max_features': 0.3926218691035751, 'n_estimators': 182}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:20,062] Trial 45 finished with value: 0.9596180717279925 and parameters: {'max_depth': 27, 'max_features': 0.20070379523636273, 'n_estimators': 103}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:21,928] Trial 46 finished with value: 0.9648812296227295 and parameters: {'max_depth': 39, 'max_features': 0.2262102823158007, 'n_estimators': 192}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:23,681] Trial 47 finished with value: 0.9648812296227295 and parameters: {'max_depth': 42, 'max_features': 0.21154640307009018, 'n_estimators': 179}. Best is trial 40 with value: 0.9666511411271541.
[I 2021-09-23 06:46:27,221] Trial 48 finished with value: 0.968421052631579 and parameters: {'max_depth': 13, 'max_features': 0.5150620025995097, 'n_estimators': 268}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:30,613] Trial 49 finished with value: 0.968421052631579 and parameters: {'max_depth': 13, 'max_features': 0.5331741027433536, 'n_estimators': 263}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:34,135] Trial 50 finished with value: 0.9578792113025927 and parameters: {'max_depth': 13, 'max_features': 0.6088892609241104, 'n_estimators': 249}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:37,691] Trial 51 finished with value: 0.9666511411271541 and parameters: {'max_depth': 10, 'max_features': 0.5005211447500874, 'n_estimators': 274}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:41,115] Trial 52 finished with value: 0.9666511411271541 and parameters: {'max_depth': 11, 'max_features': 0.5270075737803348, 'n_estimators': 256}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:44,625] Trial 53 finished with value: 0.968421052631579 and parameters: {'max_depth': 11, 'max_features': 0.5327507689968285, 'n_estimators': 264}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:49,679] Trial 54 finished with value: 0.9613879832324173 and parameters: {'max_depth': 7, 'max_features': 0.6475633789841244, 'n_estimators': 346}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:53,394] Trial 55 finished with value: 0.9613879832324173 and parameters: {'max_depth': 10, 'max_features': 0.5337701892828655, 'n_estimators': 271}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:46:57,068] Trial 56 finished with value: 0.9631113181183046 and parameters: {'max_depth': 5, 'max_features': 0.5162466114601653, 'n_estimators': 288}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:47:03,202] Trial 57 finished with value: 0.9613879832324173 and parameters: {'max_depth': 9, 'max_features': 0.7434622439009879, 'n_estimators': 384}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:47:07,799] Trial 58 finished with value: 0.9631268436578171 and parameters: {'max_depth': 14, 'max_features': 0.5941045922128059, 'n_estimators': 332}. Best is trial 48 with value: 0.968421052631579.
[I 2021-09-23 06:47:11,341] Trial 59 finished with value: 0.9631578947368421 and parameters: {'max_depth': 8, 'max_features': 0.6420539195421321, 'n_estimators': 239}. Best is trial 48 with value: 0.968421052631579.
study.best_params
{'max_depth': 13, 'max_features': 0.5150620025995097, 'n_estimators': 268}
study.best_value
0.968421052631579

Sampling algorithms

import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=3)

def plot_results(study, p1, p2, j, cb):
    study.trials_dataframe().plot(
        kind='scatter', ax=axes[j], x=p1, y=p2,
        c='value', s=60, cmap=plt.get_cmap("jet"), 
        colorbar=cb, label="accuracy", figsize=(16, 4)
    )

plot_results(study, 'params_max_depth',    'params_n_estimators', j=0, cb=False)
plot_results(study, 'params_max_depth',    'params_max_features', j=1, cb=False)
plot_results(study, 'params_n_estimators', 'params_max_features', j=2, cb=True);
../_images/hyperopt-optuna2_16_0.png

Figure. TPE in action. Optuna uses Tree-structured Parzen Estimater (TPE) [BBBK11] as the default sampler which is a form of Bayesian optimization. Observe that the hyperparameter space is searched more efficiently than a random search with the sampler choosing points closer to previous good results. Samplers are specified when creating a study:

study = create_study(direction="maximize", sampler=optuna.samplers.TPESampler())

From the docs:

On each trial, for each parameter, TPE fits one Gaussian Mixture Model (GMM) l(x) to the set of parameter values associated with the best objective values, and another GMM g(x) to the remaining parameter values. It chooses the parameter value x that maximizes the ratio l(x)/g(x).

Thus, TPE samples every hyperparameter independently — no explicit hyperparameter interactions are considered when sampling future trials, although other parameters implicitly affect objective value. Optuna also implements old friends random and grid search in the following samplers:

  • optuna.samplers.GridSampler

  • optuna.samplers.RandomSampler

Results from the paper [ASY+19]:

../_images/fig9-optuna.png
../_images/fig10-optuna.png
../_images/optuna-results.png


TPE+CMA-ES sampling can be implemented as follows:

sampler = optuna.samplers.CmaEsSampler(
    warn_independent_sampling=False,
    independent_sampler=optuna.samplers.TPESampler()
)

This uses the CMA-ES algorithm [Han16] with TPE for searching dynamically constructed hyperparameters (as CMA-ES requires that parameters are specified prior to the optimization).

Visualizations

First define a helper function for displaying plotly plots as HTML.

from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
config={'showLink': False, 'displayModeBar': False}
fig_count = 0

# See https://github.com/executablebooks/jupyter-book/issues/93 <!>
# Solves issue of having blank plotly plots in the build. No need to
# save the generated HTML files. Probably embedded into the notebook.
def plot_html(fig):
    global fig_count
    plot(fig, filename=f'optuna-{fig_count}.html', config=config)
    display(HTML(f'optuna-{fig_count}.html'))
    fig_count += 1

Optuna provides visualization functions in the optuna.visualization library 2. The following plot shows the best objective value found as the trials progress. The increasing trend in accuracy indicates that the TPE sampler is working well, i.e. the search algorithm learns from previous trials.

plot_html(optuna.visualization.plot_optimization_history(study))

The parallel coordinate plot gives us a feel of how the hyperparameters interact. For instance, max_features around 0.5 with n_estimators around 280 and max_depth around 20 generally perform well. This setting includes the best performing hyperparameters. To isolate subsets of lines, use the interactive capabilities of the plot below by dragging on each axis to restrict it. See figure immediately below.

plot_html(optuna.visualization.plot_parallel_coordinate(study))
../_images/optuna-restrict-rf.png

Using sliders to restrict values for certain parameters.

Slice plots project the path of the optimizer in the hyperparameter space on each dimension, then shift along the \(y\)-axis according on its objective value. A large spread of dark dots indicate that a large range of values of that hyperparameter is feasible even at later stages. Meanwhile, a small spread means that the sampler focuses on a small part of the search space — in this case, other hyperparameters implicitly improve the objective. For example, the parameter max_features is explored at a wide range even at later trials. Hence, we think of this feature as important. Indeed, the importance plot below supports this.

plot_html(optuna.visualization.plot_slice(study, params=['n_estimators', 'max_depth', 'max_features']))

By default, the hyperparameter importance evaluator in Optuna is optuna.importance.FanovaImportanceEvaluator. This takes as input performance data gathered with different hyperparameter settings of the algorithm, fits a random forest to capture the relationship between hyperparameters and performance, and then applies functional ANOVA to assess how important each of the hyperparameters and each low-order interaction of hyperparameters is to performance [HHLB14]. From the docs:

The performance of fANOVA depends on the prediction performance of the underlying random forest model. In order to obtain high prediction performance, it is necessary to cover a wide range of the hyperparameter search space. It is recommended to use an exploration-oriented sampler such as RandomSampler.

fig = optuna.visualization.plot_param_importances(study)
fig.update_layout(width=600, height=350)
plot_html(fig)

To visualize interactions of any pair of hyperparameters, we use contour plots. The contour plots indicate regions of high and low objective value.

fig = optuna.visualization.plot_contour(study, params=["max_depth", "max_features"])
fig.update_layout(width=550, height=500)
plot_html(fig)

Neural networks

As noted above, we should always perform tuning within a cross-validation framework. However, with neural networks, doing 5-fold CV would require too much compute time — hence, too much resources, e.g. GPU usage. Instead, we perform tuning on a hold-out validation set and hope for the best.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import Dataset, DataLoader

from sklearn import model_selection
from sklearn.datasets import fetch_openml

from tqdm import tqdm
import optuna
import numpy as np

Define a simple network.

class MLPClassifier(nn.Module):
    """
    Neural network with multiple hidden fully-connected layers with ReLU 
    activation and dropout.
    """
    
    def __init__(self, input_size, num_classes, n_layers, out_features, drop_rate):
        super().__init__()
        layers = []
        in_features = input_size
        for i in range(n_layers):

            m = nn.Linear(in_features, out_features[i])
            nn.init.kaiming_normal_(m.weight)
            nn.init.constant_(m.bias, 0)

            layers.append(m)
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(drop_rate))

            in_features = out_features[i]

        layers.append(nn.Linear(in_features, num_classes))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

We also define a Dataset class for MNIST.

class MNISTDataset(Dataset):
    def __init__(self, features, targets, transform=None):
        self.features = features
        self.targets = targets
        self.transform = transform
        
    def __len__(self):
        return self.features.shape[0]
    
    def __getitem__(self, i):
        X = self.features[i, :]
        y = self.targets[i]
        
        if self.transform is not None:
            X = self.transform(X)
            
        return X, y

Define a trainer for the neural network model. This will handle all loss and metric evaluation, as well as backpropagation.

class Engine:
    """Neural network trainer."""
    
    def __init__(self, model, device, optimizer):
        self.model = model
        self.device = device
        self.optimizer = optimizer 

    @staticmethod
    def loss_fn(outputs, targets):
        return nn.CrossEntropyLoss()(outputs, targets)
        
    def train(self, data_loader):
        """Train model on one epoch. Return train loss."""
        
        self.model.train()
        loss = 0
        for i, (data, targets) in enumerate(data_loader):
            data = data.to(self.device).reshape(data.shape[0], -1).float()
            targets = targets.to(self.device).long()
            
            # Forward pass
            outputs = self.model(data)
            J = self.loss_fn(outputs, targets)
            
            # Backward pass
            self.optimizer.zero_grad()
            J.backward()
            self.optimizer.step()

            # Cumulative loss
            loss += (J.detach().item() - loss) / (i + 1)

        return loss


    def eval(self, data_loader):
        """Return validation loss and validation accuracy."""
        
        self.model.eval()
        num_correct = 0
        num_samples = 0
        loss = 0.0
        with torch.no_grad():
            for i, (data, targets) in enumerate(data_loader):
                data = data.to(self.device).float()
                targets = targets.to(self.device)
                
                # Forward pass
                data = data.reshape(data.shape[0], -1)
                out = self.model(data)
                J = self.loss_fn(out, targets)
                _, preds = out.max(dim=1)

                # Cumulative metrics
                loss += (J.detach().item() - loss) / (i + 1)
                num_correct += (preds == targets).sum().item()
                num_samples += preds.shape[0]

        acc = num_correct / num_samples
        return loss, acc

Some config and setup prior to training. For our dataset, we use MNIST which we get from scikit-learn.

# Config
RANDOM_STATE = 42
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
EPOCHS = 100
PATIENCE = 5
INPUT_SIZE = 784
NUM_CLASSES = 10

# Fetch data
MNIST = fetch_openml("mnist_784")
X = MNIST['data'].reshape(-1, 28, 28)
y = MNIST['target'].astype(int)

# Create folds
cv = model_selection.StratifiedKFold(n_splits=5)
trn_, val_ = next(iter(cv.split(X=X, y=y)))

# Get train and valid data loaders
train_dataset = MNISTDataset(X[trn_, :], y[trn_], transform=transforms.ToTensor())
valid_dataset = MNISTDataset(X[val_, :], y[val_], transform=transforms.ToTensor())

Intermediate values

Finally, we set up the study instance and its objective function. Note that the search space is dynamically constructed depending on the number of layers (i.e. an earlier suggestion for a hyperparameter). During training, we perform early stopping on validation loss. If no new minimum val. loss is found after 5 epochs, then the minimum val. loss is returned as the objective 3.

Computing intermediate values allow us to prune unpromising trials to conserve resources. The default pruner in Optuna is optuna.pruners.MedianPruner which prunes a trial if its best intermediate result as of the current step (e.g. current best valid loss) is worse than the median of all intermediate results of previous trials at the current step. Hence, the best intermediate result of a pruned trial is less than the best intermediate result of 1/2 of the other trials as of that step. In our case, if the minimum val. loss does not improve too quickly, then the trial is pruned. Of course, the validation loss could descend rapidly at later steps, but the median pruner does not bet on this happening.

def define_model(trial):
  
    # Optimize the # of layers, hidden units and dropout ratio in each layer.
    n_layers = trial.suggest_int("n_layers", 1, 3)
    out_features = []
    drop_rate = trial.suggest_float('dropout_rate', 0.2, 0.5)
    for i in range(n_layers):
        out_features.append(trial.suggest_int("n_units_l{}".format(i), 4, 128))

    return MLPClassifier(INPUT_SIZE, NUM_CLASSES, n_layers, out_features, drop_rate)


def objective(trial):

    model = define_model(trial).to(DEVICE)
    batch_size = trial.suggest_int('batch_size', 8, 512, log=True)
    learning_rate = trial.suggest_loguniform('lr', 1e-5, 1e-1)
    weight_decay = trial.suggest_float('weight_decay', 0.0, 0.5)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3)
    engine = Engine(model, DEVICE, optimizer)

    # Init. dataloaders
    train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    valid_loader = DataLoader(dataset=valid_dataset, batch_size=batch_size, shuffle=True)
    
    # Run training
    best_loss = np.inf
    patience = PATIENCE
    for epoch in tqdm(range(EPOCHS), total=EPOCHS, leave=False):

        # Train and validation step
        train_loss = engine.train(train_loader)
        valid_loss, valid_acc = engine.eval(valid_loader)

        # Reduce learning rate
        if scheduler is not None:
            scheduler.step(valid_loss)
            
        # Early stopping
        if valid_loss < best_loss:
            best_loss = valid_loss
            patience = PATIENCE
        else:
            patience -= 1
            if patience == 0:
                break
    
        # Pruning unpromising trials
        trial.report(valid_loss, step=epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return best_loss

# Create and run optimization problem 
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=60)
[I 2021-09-23 10:38:02,389] A new study created in memory with name: no-name-15109c65-2b02-4b60-a01f-68019444fb10
[I 2021-09-23 10:59:05,092] Trial 0 finished with value: 0.159926038460392 and parameters: {'n_layers': 3, 'dropout_rate': 0.4237878293578906, 'n_units_l0': 105, 'n_units_l1': 12, 'n_units_l2': 57, 'batch_size': 11, 'lr': 0.07838898783374042, 'weight_decay': 0.014310556165731292}. Best is trial 0 with value: 0.159926038460392.
[I 2021-09-23 10:59:53,080] Trial 1 finished with value: 2.3018700839246367 and parameters: {'n_layers': 1, 'dropout_rate': 0.4940082698360224, 'n_units_l0': 4, 'batch_size': 17, 'lr': 0.021487602914889235, 'weight_decay': 0.3995315957064116}. Best is trial 0 with value: 0.159926038460392.
[I 2021-09-23 11:01:36,372] Trial 2 finished with value: 0.8710345980283376 and parameters: {'n_layers': 1, 'dropout_rate': 0.2939541603566087, 'n_units_l0': 7, 'batch_size': 387, 'lr': 0.00010912886075846094, 'weight_decay': 0.017634974179537355}. Best is trial 0 with value: 0.159926038460392.
[I 2021-09-23 11:03:45,439] Trial 3 finished with value: 0.3005696608595654 and parameters: {'n_layers': 3, 'dropout_rate': 0.3961957910723303, 'n_units_l0': 80, 'n_units_l1': 74, 'n_units_l2': 36, 'batch_size': 96, 'lr': 0.0026373954379683884, 'weight_decay': 0.24822915712352522}. Best is trial 0 with value: 0.159926038460392.
[I 2021-09-23 11:08:09,224] Trial 4 finished with value: 0.08793183607278308 and parameters: {'n_layers': 2, 'dropout_rate': 0.35574022199146005, 'n_units_l0': 105, 'n_units_l1': 76, 'batch_size': 46, 'lr': 0.0005830250147792009, 'weight_decay': 0.00454269961512882}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:11:13,412] Trial 5 finished with value: 0.17019883443939346 and parameters: {'n_layers': 2, 'dropout_rate': 0.23772345591693692, 'n_units_l0': 68, 'n_units_l1': 118, 'batch_size': 31, 'lr': 0.0010257091866609097, 'weight_decay': 0.25520616268833335}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:11:15,334] Trial 6 pruned. 
[I 2021-09-23 11:11:31,450] Trial 7 pruned. 
[I 2021-09-23 11:11:33,638] Trial 8 pruned. 
[I 2021-09-23 11:11:35,084] Trial 9 pruned. 
[I 2021-09-23 11:12:06,172] Trial 10 pruned. 
[I 2021-09-23 11:12:30,781] Trial 11 pruned. 
[I 2021-09-23 11:12:41,090] Trial 12 pruned. 
[I 2021-09-23 11:14:39,102] Trial 13 finished with value: 0.15866707545490213 and parameters: {'n_layers': 2, 'dropout_rate': 0.32028340139369776, 'n_units_l0': 46, 'n_units_l1': 84, 'batch_size': 54, 'lr': 0.00036889687117680296, 'weight_decay': 0.14755454483289712}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:14:42,890] Trial 14 pruned. 
[I 2021-09-23 11:14:46,241] Trial 15 pruned. 
[I 2021-09-23 11:14:51,457] Trial 16 pruned. 
[I 2021-09-23 11:16:10,227] Trial 17 finished with value: 0.10545255993435412 and parameters: {'n_layers': 2, 'dropout_rate': 0.28433430180887476, 'n_units_l0': 86, 'n_units_l1': 93, 'batch_size': 173, 'lr': 0.0008782486158306226, 'weight_decay': 0.10255248054833463}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:17:34,185] Trial 18 finished with value: 0.09672328307022972 and parameters: {'n_layers': 2, 'dropout_rate': 0.2810583620463708, 'n_units_l0': 87, 'n_units_l1': 103, 'batch_size': 226, 'lr': 0.0013952324252238192, 'weight_decay': 0.07617183386981775}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:17:43,721] Trial 19 pruned. 
[I 2021-09-23 11:17:44,968] Trial 20 pruned. 
[I 2021-09-23 11:18:44,154] Trial 21 finished with value: 0.103310143109411 and parameters: {'n_layers': 2, 'dropout_rate': 0.28231541834744717, 'n_units_l0': 86, 'n_units_l1': 99, 'batch_size': 177, 'lr': 0.001441251731465695, 'weight_decay': 0.08705490602314442}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:20:01,498] Trial 22 finished with value: 0.09066066514976596 and parameters: {'n_layers': 2, 'dropout_rate': 0.2773342812914818, 'n_units_l0': 96, 'n_units_l1': 125, 'batch_size': 262, 'lr': 0.0013605459441012556, 'weight_decay': 0.05281219164907667}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:20:02,752] Trial 23 pruned. 
[I 2021-09-23 11:20:07,399] Trial 24 pruned. 
[I 2021-09-23 11:20:24,401] Trial 25 pruned. 
[I 2021-09-23 11:20:25,692] Trial 26 pruned. 
[I 2021-09-23 11:20:27,766] Trial 27 pruned. 
[I 2021-09-23 11:20:28,949] Trial 28 pruned. 
[I 2021-09-23 11:20:30,214] Trial 29 pruned. 
[I 2021-09-23 11:20:37,530] Trial 30 pruned. 
[I 2021-09-23 11:21:45,135] Trial 31 finished with value: 0.10255725085735322 and parameters: {'n_layers': 2, 'dropout_rate': 0.28213738835019486, 'n_units_l0': 81, 'n_units_l1': 100, 'batch_size': 188, 'lr': 0.0010823376577192374, 'weight_decay': 0.08337185022784563}. Best is trial 4 with value: 0.08793183607278308.
[I 2021-09-23 11:21:51,880] Trial 32 pruned. 
[I 2021-09-23 11:22:17,823] Trial 33 pruned. 
[I 2021-09-23 11:22:18,941] Trial 34 pruned. 
[I 2021-09-23 11:23:57,770] Trial 35 finished with value: 0.07131263853050765 and parameters: {'n_layers': 1, 'dropout_rate': 0.25003371367463706, 'n_units_l0': 95, 'batch_size': 141, 'lr': 0.0005237534419879276, 'weight_decay': 0.07353638770877091}. Best is trial 35 with value: 0.07131263853050765.
[I 2021-09-23 11:25:37,628] Trial 36 finished with value: 0.08455202862944294 and parameters: {'n_layers': 1, 'dropout_rate': 0.24280168011286699, 'n_units_l0': 95, 'batch_size': 122, 'lr': 0.0005010708623607415, 'weight_decay': 0.1824635322987805}. Best is trial 35 with value: 0.07131263853050765.
[I 2021-09-23 11:27:02,240] Trial 37 finished with value: 0.08483004270364407 and parameters: {'n_layers': 1, 'dropout_rate': 0.22159664064331255, 'n_units_l0': 98, 'batch_size': 122, 'lr': 0.0004940711510589418, 'weight_decay': 0.1741633832571235}. Best is trial 35 with value: 0.07131263853050765.
[I 2021-09-23 11:27:03,936] Trial 38 pruned. 
[I 2021-09-23 11:28:40,523] Trial 39 finished with value: 0.09355619992511183 and parameters: {'n_layers': 1, 'dropout_rate': 0.2042784926760532, 'n_units_l0': 112, 'batch_size': 46, 'lr': 0.00022698429759967267, 'weight_decay': 0.2794574787467307}. Best is trial 35 with value: 0.07131263853050765.
[I 2021-09-23 11:29:18,054] Trial 40 pruned. 
[I 2021-09-23 11:29:49,351] Trial 41 pruned. 
[I 2021-09-23 11:31:15,710] Trial 42 finished with value: 0.1020486264151859 and parameters: {'n_layers': 1, 'dropout_rate': 0.23164804977568673, 'n_units_l0': 93, 'batch_size': 77, 'lr': 0.000285671272172161, 'weight_decay': 0.32568189877238407}. Best is trial 35 with value: 0.07131263853050765.
[I 2021-09-23 11:31:17,269] Trial 43 pruned. 
[I 2021-09-23 11:31:34,596] Trial 44 pruned. 
[I 2021-09-23 11:31:42,364] Trial 45 pruned. 
[I 2021-09-23 11:31:43,935] Trial 46 pruned. 
[I 2021-09-23 11:32:27,962] Trial 47 pruned. 
[I 2021-09-23 11:32:35,345] Trial 48 pruned. 
[I 2021-09-23 11:32:37,533] Trial 49 pruned. 
[I 2021-09-23 11:32:42,969] Trial 50 pruned. 
[I 2021-09-23 11:33:43,032] Trial 51 pruned. 
[I 2021-09-23 11:34:08,690] Trial 52 pruned. 
[I 2021-09-23 11:34:22,864] Trial 53 pruned. 
[I 2021-09-23 11:36:40,729] Trial 54 finished with value: 0.09744334753924168 and parameters: {'n_layers': 1, 'dropout_rate': 0.21622462411934498, 'n_units_l0': 89, 'batch_size': 31, 'lr': 0.0001329248089359777, 'weight_decay': 0.2971222095134244}. Best is trial 35 with value: 0.07131263853050765.
[I 2021-09-23 11:36:43,995] Trial 55 pruned. 
[I 2021-09-23 11:37:28,764] Trial 56 pruned. 
[I 2021-09-23 11:40:07,780] Trial 57 finished with value: 0.0686554341496194 and parameters: {'n_layers': 1, 'dropout_rate': 0.23290628211542697, 'n_units_l0': 96, 'batch_size': 64, 'lr': 0.0006717603763568699, 'weight_decay': 0.04716562652949421}. Best is trial 57 with value: 0.0686554341496194.
[I 2021-09-23 11:40:09,440] Trial 58 pruned. 
[I 2021-09-23 11:40:12,123] Trial 59 pruned. 
from optuna.trial import TrialState

pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])

print("Study statistics: ")
print("  Number of finished trials:\t", len(study.trials))
print("  Number of pruned trials:\t", len(pruned_trials))
print("  Number of complete trials:\t", len(complete_trials))

print("\nBest trial:")
trial = study.best_trial

print("  Value: ", trial.value)
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
Study statistics: 
  Number of finished trials:	 60
  Number of pruned trials:	 41
  Number of complete trials:	 19

Best trial:
  Value:  0.0686554341496194
  Params: 
    n_layers: 1
    dropout_rate: 0.23290628211542697
    n_units_l0: 96
    batch_size: 64
    lr: 0.0006717603763568699
    weight_decay: 0.04716562652949421

Trials below either early stops (gradient descent loses momentum) or gets pruned (unlikely to improve even if gradient descent continues). Note that pruning starts at Trial 5. This can be tweaked in the n_startup_trials=5 parameter of the pruner. In this case, pruning is disabled until the 5 trials finish in the same study. This is so that the pruner obtains enough information about the behavior of the gradient descent optimizer before starting to prune.

plot_html(optuna.visualization.plot_intermediate_values(study))
plot_html(optuna.visualization.plot_optimization_history(study))

Hyperparameter interactions

We look at which combinations of hyperparameters work well from the parallel coordinate plot. Note that there is something weird going on here. For example, trials with n_layers=1 has coordinates in axes where they should have no values, e.g. n_units_l1 and n_units_l2. This is a known issue for parallel plots, e.g. #1809. Turns out, lines with dynamically constructed parameters with NaNs should be skipped by plotter. Moreover, trials with NaN values are excluded from the parameter importance computation which limits its usefulness.

plot_html(optuna.visualization.plot_parallel_coordinate(study))
study.trials_dataframe().head()
number value datetime_start datetime_complete duration params_batch_size params_dropout_rate params_lr params_n_layers params_n_units_l0 params_n_units_l1 params_n_units_l2 params_weight_decay state
0 0 0.159926 2021-09-23 10:38:02.391944 2021-09-23 10:59:05.091262 0 days 00:21:02.699318 11 0.423788 0.078389 3 105 12.0 57.0 0.014311 COMPLETE
1 1 2.301870 2021-09-23 10:59:05.096371 2021-09-23 10:59:53.079261 0 days 00:00:47.982890 17 0.494008 0.021488 1 4 NaN NaN 0.399532 COMPLETE
2 2 0.871035 2021-09-23 10:59:53.084333 2021-09-23 11:01:36.371685 0 days 00:01:43.287352 387 0.293954 0.000109 1 7 NaN NaN 0.017635 COMPLETE
3 3 0.300570 2021-09-23 11:01:36.374203 2021-09-23 11:03:45.438474 0 days 00:02:09.064271 96 0.396196 0.002637 3 80 74.0 36.0 0.248229 COMPLETE
4 4 0.087932 2021-09-23 11:03:45.440817 2021-09-23 11:08:09.224148 0 days 00:04:23.783331 46 0.355740 0.000583 2 105 76.0 NaN 0.004543 COMPLETE
study.trials_dataframe().query("state=='COMPLETE'").params_n_layers.value_counts()
1    9
2    8
3    2
Name: params_n_layers, dtype: int64

Instead, we can look at each subset of trials for different values of n_layers. The resulting trials have no NaN parameters since the paramaters are sampled after a value for n_layers has been suggested. Looks like n_layers=1 works best.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Isolate a study for each value of n_layers
studies = [optuna.create_study() for j in range(3)]
for j in range(3):
    studies[j].add_trials([t for t in study.trials if t.params['n_layers'] == j+1])
    fig = optuna.visualization.plot_parallel_coordinate(studies[j])
    plot_html(fig)
[I 2021-09-23 11:45:13,690] A new study created in memory with name: no-name-b8d5c1f9-375a-4aa8-919e-4ef6fdb7492c
[I 2021-09-23 11:45:13,694] A new study created in memory with name: no-name-10bc04ca-72cd-494a-804e-99af840e0185
[I 2021-09-23 11:45:13,697] A new study created in memory with name: no-name-7f4d429f-e6b6-4624-a72f-8cede19884fa
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: ExperimentalWarning:

add_trials is experimental (supported from v2.5.0). The interface can change in the future.

/usr/local/lib/python3.7/dist-packages/optuna/study/study.py:969: ExperimentalWarning:

add_trial is experimental (supported from v2.0.0). The interface can change in the future.

From the following contour plot, we see that a low batch size is generally good, with high values of dropout, learning rate, and weight decay, and only a single hidden layer. From the above parallel plot, a hidden layer of size around 90 looks good.

fig = optuna.visualization.plot_contour(study, params=['batch_size', 'lr', 'n_layers', 'weight_decay', 'dropout_rate'])
fig.update_layout(autosize=False, width=1200, height=1200)
plot_html(fig)

Appendix: Hyperparameters of commonly used models


../_images/hyp.png

Table from p. 184 of [Tha20]. RS\(^*\) implies random search should be better.


1

Like all applied machine learning solutions.

2

See Optuna dashboard which displays the same plots that are updated in real-time.

3

In practice, we save the best model parameters at this point.